"Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models

نویسندگان

  • Philippe Schwaller
  • Theophile Gaudin
  • David Lanyi
  • Constantine Bekas
  • Teodoro Laino
چکیده

There is an intuitive analogy of an organic chemist’s understanding of a compound and a language speaker’s understanding of a word. Consequently, it is possible to introduce the basic concepts and analyze potential impacts of linguistic analysis to the world of organic chemistry. In this work, we cast the reaction prediction task as a translation problem by introducing a template-free sequence-to-sequence model, trained end-to-end and fully data-driven. We propose a novel way of tokenization, which is arbitrarily extensible with reaction information. With this approach, we demonstrate results superior to the state-of-the-art solution by a significant margin on the top-1 accuracy. Specifically, our approach achieves an accuracy of 80.3% without relying on auxiliary knowledge such as reaction templates. Also, 65.4% accuracy is reached on a larger and noisier dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions

Finding the main product of a chemical reaction is one of the important problems of organic chemistry. This paper describes a method of applying a neural machine translation model to the prediction of organic chemical reactions. In order to translate ‘reactants and reagents’ to ‘products’, a gated recurrent unit based sequence–to–sequence model and a parser to generate input tokens for model fr...

متن کامل

Rhombellanic Crystals and Quasicrystals

Design of some crystal and quasicrystal networks, based on rhombellane tiling,is presented. [1,1,1]Propellane,is a synthesized organic molecule; its hydrogenated form, the bicyclo[1.1.1]pentane,may be represented by the complete bipartite graph K2,3 which is the smallest rhombellane. Topology of translational and radial structures involving rhombellanes is described in terms of vertex symbol, c...

متن کامل

Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM

Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...

متن کامل

The modified recombinant proinsulin: a simple and efficient route to produce insulin glargine in E. coli

Background: Recombinant insulin glargine, a long-acting analogue of insulin, is expressed as proinsulin in host cell and after purification and refolding steps cleaved to active insulin by enzymatic digestion using trypsin and carboxypeptidase B. Since the proinsulin's B and C chains have several internal arginine and lysine residues, a number of impurities are generated following treatment wit...

متن کامل

Neural networks for the prediction organic chemistry reactions

Reaction prediction remains one of the great challenges for organic chemistry. Solving this problem computationally requires the programming of a vast amount of knowledge and intuition of the rules of organic chemistry and the development of algorithms for their application. It is desirable to develop algorithms that, like humans, "learn" from being exposed to examples of the application of the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1711.04810  شماره 

صفحات  -

تاریخ انتشار 2017